Section 1.2: Classifying and Storing Data Sources
Population
- A target group we want to study.
- The collection of all data from that group.
- It is often difficult (if not impossible) to obtain all data from the population.
Sample
- A subset of the population.
- Should represent the population as a whole
- It is typically easier (and sometimes only possible) to collect data from a sample.
Suppose you want to know what the predominant eye color in your country is. You survey a random sample of 2,500 people in your country, asking them about their eye color.
- Who is the population?
- Who is the sample?
- What data was collection?
- The population is everyone in your country.
- The sample is the 2,500 people who were surveyed.
- The data collected were participants’ eye color.
Classifying Data
- A data variable is a characteristic that is measured or recorded.
- There two types of variables: categorical and numerical.
- A variable is categorical if it describes a quality or a class.
- A categorical variable may use numbers as labels, but arithmetic operations on those numbers are not meaningful.
- Examples: Eye colors, zip code, letter grades, Social Security number (SSN)
- A variable is numerical if it describes a quantity or a measurement.
- Examples: height, weight, temperature
Classify each of the following variables as numerical or categorical.
| Variable | Numerical | Categorical |
|---|---|---|
| Height of a building | ||
| Letter grade on a test | ||
| Hours of sleep each night | ||
| Students’ GPAs | ||
| Types of cars | ||
| Vegetable varieties planted in a garden | ||
| Number of vegetable varieties planted in a garden |
| Variable | Numerical | Categorical |
|---|---|---|
| Height of a building | X | |
| Letter grade on a test | X | |
| Hours of sleep each night | X | |
| Students’ GPAs | X | |
| Types of cars | X | |
| Vegetable varieties planted in a garden | X | |
| Number of vegetable varieties planted in a garden | X |
Sorting Data
A coded data is data that uses numbers to represent information, which can make the data easier to record and interpret.
When a variable is binary (i.e., it has only two possible values), we often code it using 0 and 1, where 0 means false and 1 means true.
Suppose a local animal shelter received a litter of five surrendered puppies. A volunteer named each puppy and identified its sex in the table below. The manager wants to count the number of female puppies, so she asked you to add a new column named “Female”. How would you code this new column?
| Name | Sex |
|---|---|
| Daisy | Female |
| Hazel | Female |
| Luna | Female |
| Milo | Male |
| Rocky | Male |
If a puppy is female, assign a value of 1. Otherwise, assign a value of 0.
| Name | Sex | Female |
|---|---|---|
| Daisy | Female | 1 |
| Hazel | Female | 1 |
| Luna | Female | 1 |
| Milo | Male | 0 |
| Rocky | Male | 0 |